Skip to content

fix: Prevent premature EventManager shutdown when multiple crawlers share it#1810

Draft
Mantisus wants to merge 8 commits intoapify:masterfrom
Mantisus:fix-global-event-manager
Draft

fix: Prevent premature EventManager shutdown when multiple crawlers share it#1810
Mantisus wants to merge 8 commits intoapify:masterfrom
Mantisus:fix-global-event-manager

Conversation

@Mantisus
Copy link
Copy Markdown
Collaborator

@Mantisus Mantisus commented Mar 24, 2026

Description

  • Fixed a bug where the global EventManager was shut down prematurely when the first of multiple concurrent crawlers finished, leaving remaining crawlers with a broken event system.
  • Always starts the global event_manager, even if the event_manager argument was passed in the crawler`s constructor

Issues

Testing

  • Added a test verifying that the global EventManager remains active as long as at least one crawler is running, and is shut down only after the last crawler finishes.
  • Added a test to verify that both event_manager instances are active while the crawler is running.

@Mantisus Mantisus requested review from Pijukatel and janbuchar March 24, 2026 20:28
@Mantisus Mantisus self-assigned this Mar 24, 2026
Copy link
Copy Markdown
Collaborator

@Pijukatel Pijukatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I have just some small details

@Mantisus Mantisus requested a review from Pijukatel March 25, 2026 14:54
Comment on lines +777 to +779
contexts_to_enter: list[Any] = (
[global_event_manager, local_event_manager] if local_event_manager else [global_event_manager]
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so sure about this. Why should the crawler initialize a component that it doesn't use? And then tear it down? I think we should try to see a bigger picture first.

The linked issues are caused by Snapshotter and RecoverableState subscribing to an event manager that doesn't emit anything - how does that happen?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to remove event_manager from the BasicCrawler constructor entirely and keep only the global one, configurable via service_locator.set_event_manager. However, this requires discussion and should probably be part of v2.

I'm not so sure about this. Why should the crawler initialize a component that it doesn't use? And then tear it down? I think we should try to see a bigger picture first.

The global event_manager being started inside the crawler, while only being used by the crawler's internal components, is an early architectural decision. I assume that it was done for a better user experience.

The linked issues are caused by Snapshotter and RecoverableState subscribing to an event manager that doesn't emit anything - how does that happen?

I believe the root cause is bugs introduced during our work on supporting multiple parallel crawlers:

  • An edge case we didn't account for, demonstrated by test_multiple_crawlers_with_global_event_manager. The first crawler to activate the event_manager is fully responsible for its lifecycle. When the crawler finishes, it tears down the event_manager, which clears all subscriptions, leaving any still-running crawlers without a functioning event_manager.
  • The introduction of an internal _service_locator in BasicCrawler, which caused the user-provided event_manager to be used only for Event.CRAWLER_STATUS in _crawler_state_task, while internal components (Snapshotter, RecoverableState) continued subscribing to the global service_locator, which may never be started.

All that made the behavior of service_locator.get_event_manager() unstable and unpredictable.

This PR is an attempt to fix these issues within the current architecture without introducing breaking changes.

Copy link
Copy Markdown
Collaborator

@Pijukatel Pijukatel Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the fixes that solve 2 Crawlers using the same EventManager are good. Regarding the global event manager. I see this as a quick fix.

I would prefer to remove event_manager from the BasicCrawler constructor entirely...

I think this would lead to counterintuitive behavior due to Configuration having EventManager specific fields and they would be ignored in cases like this:

# Global event manager already exists
BasicCrawler(configuration=Configuration(persist_state_interval=...))

Maybe it would be possible for each subscriber to the event manager to have at least an optional init argument that would allow passing an explicit event_manager (if it is creating storages, then maybe service_locator argument would be even better) instead of relying on the global one.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When porting the service locator to crawlee-js (apify/crawlee#3325), I actually made the global serviceLocator object into a proxy that resolves to the crawler-scoped ServiceLocator when called from a crawler instance and to the global one otherwise.

If we adopted that in Python as well, it would solve the inconsistency caused by Snapshotter and friends still using the global locator.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll study that solution

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you get stuck on something, it's kinda magic. But that's the cost of making the library usable without factories, builders and explicit dependency containers.

@Mantisus Mantisus requested a review from vdusek March 26, 2026 00:42
"""Initialize the event manager upon entering the async context."""
self._active_ref_count += 1
if self._active_ref_count > 1:
return self
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential task leak for multi-crawler scenarios; from Claude:

Bug: LocalEventManager not updated for ref-counting — causes task leak

The base class now correctly guards _emit_persist_state_event_rec_task.start() behind ref_count == 1 (line 110). However, LocalEventManager.__aenter__ (in _local_event_manager.py:72-79) unconditionally calls self._emit_system_info_event_rec_task.start() on every entry:

async def __aenter__(self) -> LocalEventManager:
    await super().__aenter__()
    self._emit_system_info_event_rec_task.start()  # Called EVERY time!
    return self

Since RecurringTask.start() creates a new asyncio.Task and overwrites self.task without cancelling the previous one (_utils/recurring_task.py:60-66), the old task is leaked (still running, but no reference to cancel it).

In the multi-crawler scenario with the default shared global LocalEventManager:

  1. Crawler 1 enters → ref_count=1, system_info task A created
  2. Crawler 2 enters → ref_count=2, system_info task B created, task A orphaned
  3. Crawler 1 exits → stop() cancels task B, ref_count=1 — but task A is still running with no reference
  4. Crawler 2 exits → stop() tries to cancel already-cancelled task B — task A never stopped

LocalEventManager needs the same ref-count-aware guards:

async def __aenter__(self) -> LocalEventManager:
    await super().__aenter__()
    if self._active_ref_count == 1:  # Only start on first entry
        self._emit_system_info_event_rec_task.start()
    return self

async def __aexit__(self, ...) -> None:
    if self._active_ref_count == 1:  # Only stop on last exit
        await self._emit_system_info_event_rec_task.stop()
    await super().__aexit__(exc_type, exc_value, exc_traceback)

Note: checking == 1 in __aexit__ must happen before super().__aexit__() decrements the count.

# Emit persist state event to ensure the latest state is saved before closing the context.
await self._emit_persist_state_event()
self._active_ref_count -= 1
return
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't call wait_for_all_listeners_to_complete in this branch.

From Claude:

Consideration: Intermediate exit skips listener cleanup

When ref_count > 1, the exit emits a persist-state event but does not call wait_for_all_listeners_to_complete() or clean up listeners registered by the exiting crawler.

In practice this is likely fine because individual components (Snapshotter, RecoverableState) clean up their own listeners via off() in their __aexit__/teardown(). But any component that relies on the event manager's remove_all_listeners() as the cleanup mechanism would have stale listeners firing after its context is torn down.

Worth adding a comment here explaining this design choice — e.g. that individual components are responsible for their own listener cleanup, and remove_all_listeners() on final exit is just a safety net.

Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>
@Mantisus Mantisus marked this pull request as draft March 26, 2026 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

When using event_manager to initialize the crawler, the global event_manager doesn't work. Running inside Docker doesn't take care of persistence

5 participants